Humor appreciation can be predicted with machine learning techniques

Humor research is supposed to predict whether something is funny. According to its theories and observations, amusement should be predictable based on a wide variety of variables. We test the practical value of humor appreciation research in terms of prediction accuracy. We find that machine learning methods (boosted decision trees) can indeed predict humor appreciation with an accuracy close to its theoretical ceiling. However, individual demographic and psychological variables, while replicating previous statistical findings, offer only negligible gains in accuracy. Successful predictions require previous ratings by the same rater, unless highly specific interactions between rater and joke content can be assessed. We discuss implications for humor research, and offer advice for practitioners designing content recommendations engines or entertainment platforms, as well as other research fields aiming to review their practical usefulness.


Predictors of amusement
Factors that influence whether someone will be amused by a humor stimulus can be divided into different categories.One of these categories is the nature of the joke, including its content and structural qualities 4,6 .Early theories suggested that jokes with certain contents are funnier than others because they tie in more closely with our supposed reasons for laughing.For instance, Freud suggested a cathartic function of jokes and laughter, as a valve for releasing tension from societally suppressed drives towards sex and aggression 7 .Thus, a hypothesis was formed (and sometimes supported 8 ) that sexual or aggressive humor stimuli are funnier than others.In fact, even earlier theorizing suggests that all instances of humor consist of a satisfactory domination of others 9,10 .Other theories posit that people are more likely to laugh, when a joke elicits a strong but manageable cognitive violation (for a distinction with surprisingness/incongruity, see 11 ) or when a joke has a strong elaboration potential (i.e., many funny implications 12 ).To zoom out, such theories state that some jokes are simply funnier than others; they imply that to predict amusement, one should evaluate how good the respective joke is.
There is parallel stream of empirical evidence, sometimes coming from the same groups of researchers, showing that characteristics of the audience members are also predictive of humor appreciation.In short, some people will laugh at most jokes, whereas others remain hard to amuse 13 .One of the most differential traits when it comes to humor appreciation is a person's dispositional cheerfulness, the temperamental basis for humor, often used

Predictive accuracy as practical relevance
Traditionally, psychological sciences focus on offering coherent explanations for human behavior.Usually, this is achieved by accumulating and mapping empirical observations to causal theories.Many such theories imply patterns of statistical associations between two or more relevant measurements.While theories often state that one measure should 'predict' the other (i.e., a non-zero statistical association), there is usually no pre-defined cut-off for how accurate this prediction should be to support the theory.
For example, various findings suggest that bad quality of sleep is associated with depression and suicide 30 .A theorist might wonder about the mechanism and directionality of the association.Conversely, a practitioner might wonder whether it is now feasible to build the researchers' statistical model into their fitness tracker app to enable an "early-warning" feature for poor sleepers 31 .If the model can accurately predict depression risk for new participants, one could even consider a direct alert going out to local health care professionals, similar to car crash detection in mobile phones.However, most analyses in psychological research do not inform practitioners about the predictive accuracy that their theory affords 32 .Integrating machine learning models and, more importantly, out-of-sample accuracy assessments would therefore constitute a clear step from theoretical to practical usefulness 33 .
Regarding humor, most machine learning systems are focused on humor detection, meaning that models automatically infer which texts or images were meant to convey humor [34][35][36][37] .Interestingly, state-of-the-art performance in humor detection is achieved by models that are designed based on psychological humor theories 38 .While these research efforts do not focus on humor appreciation, they highlight that computational and psychological research can mutually reinforce each other when analyzing humor stimuli.Some research also exists on machine predictions of humor appreciation.Specifically, models are trained to predict which of two humor stimuli will be rated as funnier, when aggregating ratings of multiple participants 39 .The efforts are often only partially successful meaning models make many mistakes in ranking the funniness of humor stimuli 40,41 .
In this work, we change the level of analysis to pairings of individual raters with individual stimuli (i.e., no aggregation).Ruch reviewed the stark effect of the aggregation choice on later results (e.g., meaning and magnitude of correlations between funniness ratings and auxiliary variables 42 ).Generally, aggregation boosts reliability and thus our focus on single experiences of humor by single individuals makes prediction more difficult.However, a compensatory advantage is that individual-level characteristics of both, the stimulus and the rater, can be used for the prediction of amusement.
The benefit of reviewing the predictive power of humor research is that practitioners can use the new insights to build content recommendation websites, revise their comedy routine, or forecast the success rate of their clinical humor treatments.Generally, it gives insight into the practical usefulness of the many variables that appear in recent humor theories.

Study 1
We assess whether prominent constructs from humor research can accurately predict humor appreciation and which specific constructs offer the largest gains in empirical accuracy.All participants provided informed consent.The study procedure in accordance with the APA ethics code as well as the Declaration of Helsinki, and was approved by the Faculty Ethics Review Board of the first author's research institution under leadership of Lourens Waldorp.

Procedure
In an initial survey, participants filled out a questionnaire assessing constructs relating to personality, morality, ability, humor attitudes, and demographic data.Two days later, they received an invitation to a second survey which assessed their current mood as well as humor appreciation for 10 stimuli.It was further asked how well participants understood the English texts of the humor stimuli.All participants passed an attention check asking for the sum of eight and two.The data from the initial study was merged with the second survey using the participants' unique user ID on the platform.

Materials
When trying to estimate the predictive value of psychological research it is of central importance to select promising predictor variables.All constructs and variables outlined below were selected based on their prominence in the literature (cf., sources above, and per measure below).Some noteworthy omissions are discussed in the limitation section.We relied on published short versions of psychological scales.All stimuli and newly generated questions about people's attitude towards the content of the humor stimuli are included in the Supplementary materials.
Big Five.Various meta-analyses and reviews of the relationship between the Big Five personality traits and humor have been conducted 25,26 .The Big Five Inventory 10 (BFI-10) is a ten-item version of The Big Five Inventory (BFI) developed by Rammstedt and John 43 and measures agreeableness, extraversion, openness to experience, conscientiousness, and neuroticism.Answers are given on a 5-point scale ("Disagree strongly" to "Agree strongly").The scale authors report an average six-to-eight-week retest-reliability of 0.75.
Cheerfulness.In humor appreciation, state cheerfulness is one of the central constructs as it forms part of the temperamental basis for humor 44 .The State-Trait Cheerfulness Inventory State Version-Short Form (STCI-S18 45 ) is a short version of the State-Trait Cheerfulness Inventory State Version (STCI 16 ).It is an 18-item survey that measures cheerfulness on three dimensions: cheerfulness, seriousness, and bad mood.Because mood is already measured separately, only the 6 cheerfulness items were used here.The answers to these items were given on a 4-point scale ("Strongly disagree" to "Strongly agree").The authors report a four-to-five-week retestreliability of 0.85.The Cronbach's alpha coefficient in the current study was 0.77.Moral identity.In humor research, a negative effect of morality on humor appreciation was observed by various groups [46][47][48] for at least some forms of humor.The five-item internalization dimension of the Moral Identity Scale 49 was developed to measure moral beliefs.The participant is asked to imagine a person that is "caring, compassionate, fair, friendly, generous, helpful, hardworking, honest, kind".After this, five identification items are answered on a 5-point scale ("Strongly disagree" to "Strongly agree").The original authors report a retestreliability of 0.49 over a four-to-six-week time period, arguing that the construct varies over time.The Cronbach's alpha coefficient in the current study was 0.7.
Sensation seeking.While the relationship between dispositional sensation-seeking and humor appreciation is sometimes positive and sometimes negative (depending on the nature of the humor stimuli) it remains a potent predictor of humor appreciation across cultures 23 .The Brief Sensation Seeking Scale (BSSS 50 ) is an eight-item survey created to measure the components of sensation seeking.The answers to the BSSS are given on a 5-point scale ("Strongly disagree" to "Strongly agree").The original authors of the scale did not report a retest-reliability but a Dutch version achieved a two-week retest reliability of 0.93 51 .The Cronbach's alpha coefficient in the current study was 0.81.
Humor production.Moran and colleagues 20 found that people that were better at producing humor, were less appreciative of humor stimuli than others (see 19 for a set of opposing results).The Multidimensional sense of humor scale (MSHS 52 ) assesses respondents' self-rated skill in being funny.The relevant subscale encompasses four items.Answers are given on a 5-point scale ("Strongly disagree" to "Strongly agree").The original authors did not report a retest-reliability, but a Finnish version showed a three-year retest reliability for the subscale of 0.75 53 .The Cronbach's alpha coefficient in the current study was 0.75.www.nature.com/scientificreports/Mood.It is an ongoing question which mood facets influence humor appreciation (the most 13,15,54 ).The International Positive and Negative Affect Schedule Short Form (I-PANAS-SF 55 ) is a short version of the Positive and Negative Affect Schedule (PANAS 56 ).It is a 10-item measurement in which people rate how they generally feel (e.g., "upset", "nervous") on a 5-point scale ("never" to "always").The authors report an eight-week retestreliability of 0.85.
Stimulus-specific attitudes.Broad traits, as listed above, are prominent in humor research, but narrow constructs can be highly predictive as well if they fit the context well.In the case of humor appreciation this pertains specifically to people's attitudes towards specific joke contents 57 .Knowing whether someone is anti-Trump will likely be informative for their amusement from anti-Trump jokes, albeit less informative for broader humor categories.Here, we assess stimulus-specific attitudes with newly developed questions about the specific humor stimuli displayed in the survey Examples of these are "I like pictures of cute animals" or "I am very familiar with Game of Thrones".The respective jokes included a humorous picture of a dog and a meme featuring a scene from Game of Thrones.Attitude items were rated on a 5-point scale from ("Strongly disagree" to "Strongly agree").
Humor appreciation.The prediction target, humor appreciation, was measured for ten different humor stimuli with two items per stimulus.These stimuli were selected from a previously published corpus of 105 diverse humor stimuli 27 .The selected humor stimuli were categorized by the original authors as falling into five categories (there were more in the corpus): affiliative, self-defeating, aggressive, self-enhancing, and sexual 58,59 .For the current study, one textual and one image stimulus were selected from each category.Note that the assignment of the stimuli into one of the five categories is not objective.The original authors write "Note that most stimuli fell into multiple categories […] because everyday humor usually combines different dimensions" ( 27 , p. 1390).Table 1 shows an example stimulus for each of the five categories.Note, for instance, that jokes which weren't assigned to the category 'aggressive' still vary in their aggressiveness.
Here, humor appreciation was measured with two questions: "How funny do you find this {text/image}?"(1 = "not funny at all" to 7 = "very funny" 27,48,57 , and "Final evaluation:" with answer options "Good joke" and "Bad joke").We selected a continuous and a binary measure to have complementary accuracy metrics available during the out-of-sample accuracy analysis.

Sample
A German panel provider offered participation through their mobile application where anonymous users can generate and distribute their own opinion polls, as well as answer polls of others.The app includes gamified rewards such as levels and coins that can be gathered by generating and participating in surveys.Once a certain number of coins is reached, participants can donate 10 Euro to protect the rainforest or get 10 Euro transferred to their PayPal account.However, the main driver for participation is entertainment rather than monetary rewards, as well as the opportunity for anyone to conduct representative polls outside one's own social bubble 60 .Participants from Germany (58.87%), the UK (24.79%), and the USA (16.33%) entered the study for a total of 5473 responders (50.23% male, 49.77% female).Ages ranged from 18 to 84 (M age = 32.68,SD age = 10.76).All participants indicated in a previous assessment that they can read and speak English and passed an English attention check question.

Analysis plan
The ultimate goal of the current work is to assess how accurately one can predict humor appreciation on unseen cases, and which predictors are the most useful in that regard.Thus, the focal pieces of the current study are prediction models relating humor appreciation to a range of predictors from psychological research.Given the tabular nature of the data, we utilized boosted decision trees (XGboost 61 ) as the prediction model (ensemble).The XGboost algorithm is currently considered the state-of-the-art technique for predicting target variables based on tabular data 62 .Tabular data refers to the data commonly found in social science investigations with multiple observations scored on qualitatively distinct variables which can be organized in a table-like format.Other collections of data include images, audio, or text, for which other approaches, most notably neural networks, trump the performance of tree-based algorithms like XGboost (for a review of both approaches see 63 ).
Much like in OLS regression models, the XGboost algorithm relates predictor variables to an outcome variable, and its inner parameters are optimized for prediction accuracy.However, rather than fitting a linear combination of predictors, XGboost is based on decision trees that funnel observations down different "branches" based on logical statements (e.g., "Age > 30" goes left, other datapoints go right).These splits are optimized to www.nature.com/scientificreports/create end nodes (i.e., "leaves") on which data points have similar scores on the outcome variable.The XGboost algorithm then iteratively introduces new trees which predict the residuals of previous trees rather than the raw target scores.Thereby, the predictions of the summed tree outputs approximate the target score while actively combating mispredictions.The added complexity of the algorithm usually results in substantial improvements of prediction accuracy compared to OLS regression.For details and implementation guidance, see González and colleagues 64 .
The entire pipeline of data transformation, model tuning, and final analysis was conducted separately for the continuous measure of humor appreciation (metrics: R 2 , RMSE) and the binary measure (metrics: AUC, accuracy).Data was transformed into a long format with a single response per row (54,730 rows).One response per participant was randomly selected to form part of the test set which was only accessed once during final evaluation (5473 rows).We used tidymodels and its associated packages 65 in R 66 for both tuning and evaluating the models.We used tenfold cross-validation to assess the performance of 42 hyperparameter settings and selected the best performing setting to fit the model on the full training data.We compute variable importance and final performance on the test set.Variable importance scores were generated by permutating the values for individual predictor variables and quantifying associated model performance drops in the test set.The generation of all results was done in the same way but separately for three sets of predictors: (1) all psychological constructs (including demographic info), (2) average ratings of respectively the current joke and of the current participant, and (3) the combination of both of these predictor sets.Responses in the test set were excluded when computing rating aggregates.All data and scripts are in the Supplementary materials.We compare model accuracies by inspecting 95% bootstrap confidence intervals.Formal tests for comparing the predictor sets were not necessary given the width and separation of intervals (see below).

Results
Distributions of humor appreciation for the binary measure (overall: 56.96% "good joke") and the continuous measure (overall: M = 4.34, SD = 1.92) for each of the 10 stimuli are depicted in Fig. 1.
Hyperparameter settings for the XGboost algorithms were selected based on cross-validation performance and can be reproduced with the supplementary scripts.Table 2 shows all accuracy metrics of each model when predicting humor appreciation.
The Supplementary material includes scripts and results for an alternative predictor setup, where individual survey items are used as predictors instead of the underlying psychological scales.The performance is virtually the same as shown here.
Variable importance scores (see Fig. 2) were quantified as average performance drops on the test set when randomly permutating predictor variables (100 iterations).
It is apparent that the model rests primarily on how other people rated the current joke, and the current rater's appreciation of other jokes.Scrambling any of the other psychological constructs and variables leads to negligible deviations in model performance.If one assumes that participants responded to all the jokes equally, were not indicating their authentic appreciation of the stimuli, their responses could artificially inflate the predictive power www.nature.com/scientificreports/ of past ratings.However, after removing this group of 249 people (4.5%), and retraining the models, predictions were still primarily based on past responses.Figure 3 shows the zero-order associations between individual psychological variables from psychological research and the separate funniness rating for all 10 stimuli.

Discussion
The accuracy with which humor appreciation could be forecasted was around 70% and R 2 = 0.36.While this is substantially better than chance, predictions for individual people are thus far from reliable even when considering a wide variety of predictors from psychological research.While we calculated lower accuracy limits by establishing guessing baselines, we could not quantify a higher limit that accounts for random (i.e., non-predictable) noise in the humor appreciation scores.Study 2 quantifies this limit as the retest reliability of the predicted variable.This will clarify whether the remaining errors are due to weak predictors or unreliable measurements of humor appreciation.
A positive take-away for practitioners is that simple aggregates of a user's past behavior on the platform (i.e., here operationalized as funniness ratings) were almost sufficient to achieve the maximal model performance with only small incremental gains coming from the assessment of psychological constructs.Before concluding that much psychological work has little incremental value for the prediction of humor appreciation there is an important caveat to consider; many psychological constructs were previously highlighted to matter because they interact with the nature of the material when steering humor appreciation.While the nomenclature differs, various research groups have found person-specific differentiation of stimuli into aggressive vs affiliative 58 , violating vs non-violating 28 , or adaptive vs maladaptive humor 67 .The resulting humor appreciation depends on a person's unique appraisal of these attributes.In Study 1, we did not actively encode such rater X stimulus interaction terms warranting an extended analysis for predictor importance in Study 2.
Bi-variate analyses in Study 1 replicated much previous work.For instance, cheerfulness and mood were consistent predictors of higher humor appreciation 68 .Further, we also observe lower appreciation scores for people scoring high on morality 48 .A more surprising observation was that people finding the English language in the materials generally harder to understand, rated jokes funnier.This is counterintuitive as understanding (or "getting") the joke is usually a strong predictor for finding it funny 69 .We assume that the current combination of sample (mostly native/fluent English speakers) and materials (no challenging vocabulary) led to virtually everyone understanding the jokes without any problems (median = 4; 5-point scale).Thus, the variation in selfindicated understanding could be mostly due to other factors like self-ascribed intellect.This surprising finding is followed up on in Study 2.

Study 2: Pre-registered follow-up
In Study 2, we aim to address three questions remaining after Study 1: -Can we predict humor appreciation more accurately by including rater X stimulus interactions?-What is the reliability of the humor appreciation scores (i.e., the maximally achievable prediction accuracy)?-Do we replicate the findings of Study 1 with new stimuli (including the finding that lower understanding of the stimuli is associated with higher stimulus appreciation)?
To that end, we conduct a pre-registered replication of Study 1, including different participants and humor stimuli.Further, we extend the measures to include interactions between rater and stimulus characteristics, Lastly, we estimate the reliability of humor appreciation scores through a test-retest procedure.

Method
The procedure, assessed constructs (Cronbach's alpha deviations from first study < 0.02), and analysis plan stayed the same as in Study 1 with four exceptions: 1. We used 10 new humor stimuli from the corpus of jokes from Rosenbusch and colleagues 27 .All stimuli are available in the Supplementary materials.Again, we picked one textual and one image joke from each of the pre-annotated categories of self-defeating, affiliative, aggressive, self-enhancing and sexual humor to ensure stimulus diversity 58,59 .Table 3 shows the text stimuli.2. The general item "How hard was it for you to understand the English language/vocabulary in the jokes?" was replaced with 10 stimulus-specific items: "I understood this joke" (1 = strongly disagree, 5 = strongly www.nature.com/scientificreports/agree).This was implemented to shed light on the observation from Study 1 that generally poorer language understanding was associated with more humor appreciation overall.3. Two Likert-items per humor stimulus were added for each stimulus ("The joke seems aggressive" and "The joke seems friendly"; 1 = strongly disagree, 5 = strongly agree).These will be used to quantify the interaction between the nature of specific stimuli and the unique perception of each rater.

Category Stimulus
Affiliative "Why did the bee marry?He's finally found his honey" Self-defeating "No more self-deprecating jokes, I whisper fatly" Aggressive "Light travels faster than sound.This is why some people appear bright until you hear them speak" Self-enhancing "My calculator stopped working mid way through my exam.I can't count on it anymore" Sexual "Do you need a stud in your life?Cause I got the STD and all I need is U"  www.nature.com/scientificreports/scale-based model (including these variables) now surpasses the performance of the purely "past ratings"-based model.
Figure 7 shows the zero-order associations between individual variables from psychological research and the separate funniness rating for all 10 stimuli.
In Study 2, where understanding of jokes was measured separately for each stimulus, the association between understanding and appreciating a joke is positive.This addresses the previously paradoxical finding that more (general) understanding predicts less (general) appreciation (cf., Study 1).

General discussion
Imagine you just thought of a fantastic joke.There are 100 contacts in your phone, but you are usure who to tell the joke to.Unbeknownst to you, 60 of them would laugh and 40 would not.Thus, if you told the joke to a randomly selected person, you would have a 60:40 chance of success.Applying much of the psychological research on humor appreciation from the last decades would boost this chance to about 76:24, a noticeable improvement but far from certainty.An obvious catch is that one needs to closely analyze specific characteristics of your contacts to achieve this improvement.Specifically, our results highlight the predictive value of past amusement proclivity, above all other constructs.The strongest "laugher" among your contacts is likely a safe bet.The next best class of predictors relate to the interaction between the specific joke content and the specific receiver characteristics including whether the person is likely to understand the joke or consider it overly malicious.
One reason for imperfect model accuracies appears to be the difficulty of measuring, rather than predicting, individual instances of amusement reliably.People's average reaction to humor stimuli can be measured with higher reliability but predicting these scores would disallow the usage of stimuli-level predictor variables that were of interest here (for measurement of humor appreciation, see 70 ; for response aggregation, see 42 ).Without reliable amusement responses, one cannot obtain flawless predictions for individual reactions.Situational fluctuations of amusement are well-known in the humor literature and integrating state constructs like a user's current rating tendency pushed the models to (and sometimes beyond) the outcome's temporal stability.To which degree the remaining errors are due to suboptimal model performance vs. poor measurement in the validation data remains uncertain and is methodologically challenging to address 71 .
The high predictive value of rating aggregates is encouraging for practitioners aiming to build humor applications and tools.For instance, recommendation engines that gradually collect user behaviors appear more promising than systems relying on tests of user personality.Further, many collaborative and content-based recommendation engines are designed to slowly capture optimal combinations of user and content characteristics.Based on the current research, integrating user tendencies with user-by-content interaction features is optimal to achieve a high recommendation performance.This is in line with findings achieved in a variance-decomposition approach 27 .Note that most humor appreciation theories acknowledge such interactivity in some way.For instance, the Benign Violation Theory forwards that person-specific perceptions of stimuli's aggressiveness and benignity lie at the core of amusement 28 .
The funniness ratings used in the current work are similar to explicit content ratings requested by many online entertainment platforms.Importantly, many platforms also use implicit measures to assess user preferences www.nature.com/scientificreports/(e.g., searches, view time).Measures of laughter or smiling, the closest behavioral companions to amusement, are usually not available to content providers.These behaviors are also rare and difficult to interpret in solitary settings 72 .Due to its social functions, laughter might come with an altered pattern of predictors (e.g., a higher predictive value of social personality traits) compared to the current study.
In order to obtain explicit preferences, entertainment platforms sometimes ask their consumers to describe what kind of content they would like to see.Our investigations show that this method works on occasion.Participant's explicit attitudes (e.g., whether they like cheesy pick-up lines, or jokes about people's appearance) were predictive of their later humor appreciation.However, these topic-specific questions did not perform consistently and would require effortful questionnaire work by users who want to be entertained.Similarly, correlations between personality/demographics and humor appreciation from the humor literature were largely replicated, but offer little incremental value in terms of prediction power.For instance, gender, age, and the Big 5, do show face-valid relationships (e.g., agreeable people disliking some aggressive jokes), however these relationships were generally weak and inconsequential for model performance.Thus, humor research on these variables continues to be insightful, but more so for theory development than practical applications.
We replicated the finding that people's rating aggregates were more predictive than the stimuli's rating aggregates 27 .However, the current study relativizes the previous authors' advice to prioritize rater characteristics when predicting humor appreciation.Specifically, if we accepted the predictive value of prominent psychological constructs as relevant, one would be well advised to consider the success rate of the joke (among other raters) as well, because it outperformed all personality variables.Thus, we would extend the cited recommendation into "focus on audience tendencies, specifically their past ratings, while not discarding average stimulus performance".

Limitations and future directions
A general limitation that can be addressed gradually is the inclusion of additional predictor variables.Past research points towards certain interaction effects that were not included here but nonetheless steer humor appreciation: people's dispositional preference for certain joke structures.An example is conservatism which steers people's preference for a full resolution of incongruity at the end of jokes (as opposed to nonsense jokes 70 ).Similarly, preferences for certain joke formats (e.g., text vs. video) can be tied to rater attributes (e.g., the ability to read) and can, depending on the stimuli and rater sample, become informative.
Additionally, we assume that additional sources of predictors, such as attributes of the joke teller or the relationship between joke teller and receiver can offer incremental gains.In past research, social cues and interpersonal perception affected the appreciation of the joke substantially (e.g., a halo effect of attractiveness 73 ; gender dynamics between teller and listener 74 ).To make this point clear, if you, the reader, feel like you could outperform the current models in predicting which of your contacts would laugh at your joke, you might rely on social cues that were not part of the current study.For instance, you are aware of the status dynamics at play when you speak to your personal contacts, and these affect humor responses significantly 75 .It is challenging to include such predictors in large-scale studies as the source of the joke has to be varied alongside the stimuli and rater characteristics.Here, we concentrated on performance humor, meaning jokes and images that can be implemented across social situations (e.g., content produced for online consumption).We did not include spontaneous humor that is afforded by specific social interactions, thus foregoing the predictive power of interpersonal predictor variables but also reducing sample size requirements, which were already substantial due to the highly-parametrized models we used.We assume that large web-scraping jobs, potentially including video recordings of social exchanges, alongside detailed annotations of the humor instances, could function to include such social and situational variables as predictors of humor appreciation.
We want to highlight the many analyses that can be conducted with the newly generated datasets.Close to 11.000 participants from multiple countries rated the humor stimuli.The dataset includes annotations of many demographic and psychological variables, allowing to re-test many statistical associations proposed in the literature.For some analyses, like the correlation between aggressiveness and funniness ratings, it might be reasonable to reformat and aggregate responses across either stimuli or subjects 42 .Data quality appears to be satisfactory (based on face-valid inter-item and inter-construct correlation, as well as successful attention checks, and sensitivity analyses), however it remains worthwhile to consider in which way measurement noise could have influenced our results.For instance, being able to measure predictors with higher reliability (e.g., moral internalization is assumed to be temporally unstable 49 ) can increase their predictive power.People aiming to improve the performance of the prediction algorithms presented here, can extend the supplementary scripts through, for instance, advanced feature engineering.

Conclusion
Humor appreciation is predictable.Variables like a person's amusement propensity, joke popularity, and specific interactions between humor and rater characteristics allowed for prediction accuracies close to the suspected maximum.Main effects of broad psychological constructs were not useful to improve predictions.Humor research can be useful by forwarding the optimal range of predictors to content platforms.Field studies on such platforms would constitute a natural follow-up to the current work.Other research fields can apply a similar, prediction-focused approach to identify which of their studies and theories provide practical value. https://doi.org/10.1038/s41598-023-45935-1

Figure 1 .
Figure 1.Top left: Proportion of 'Good joke' ratings for each stimulus including 95% confidence intervals (binary scale).Top right: Average funniness ratings for each stimulus including 95% confidence intervals (7-point scale).Bottom left: Each participant made ten binary 'good vs bad joke' decisions averaged into a personal proportion score (0-1).The histogram shows these proportion scores for all participants.Bottom right: Each participant provided ten continuous funniness rating averaged into a personal rating score (1-7).The histogram shows these average rating scores for all participants.

Figure 2 .
Figure 2. Drops in prediction performance when a given predictor variable is scrambled.

Figure 3 .
Figure 3. Statistical associations (Pearson correlations and Cohen's D's) between predictors and continuous humor appreciation scores.The content of the individual humor stimuli (boxes 1-10) can be reviewed in the Supplementary materials.

Figure 4 .
Figure 4. Top left: Proportion of 'Good joke' ratings for each stimulus including 95% confidence intervals (binary scale).Top right: Average funniness ratings for each stimulus including 95% confidence intervals (7-point scale).Bottom left: Each participant made ten binary 'good vs bad joke' decisions averaged into a personal proportion score (0-1).The histogram shows these proportion scores for all participants.Bottom right: Each participant provided ten continuous funniness rating averaged into a personal rating score (1-7).The histogram shows these average rating scores for all participants.

Figure 5 .
Figure 5. Out-of-sample prediction accuracies achieved with the three different predictor sets.The dashed line highlights for each accuracy metric which score would be achieved by simply always guessing the mean/ majority class.The width of the points is wider than their respective 95% confidence intervals.The dotted line represents the retest-reliability of the outcome variable.

Figure 6 .
Figure 6.Drops in prediction performance when a given predictor variable is scrambled.

Figure 7 .
Figure 7. Statistical associations (Pearson correlations and Cohen's D's) between predictors and continuous humor appreciation scores.The content of the individual stimuli (boxes 1-10) can be reviewed in the Supplementary materials.

Table 1 .
Example stimuli.Text formatting was simplified for the table.
Category StimulusAffiliative "Knock, knock.Who's there?Olive.Olive, who? Olive you, and I don't care who knows it" Self-defeating "It's true that I'm CUTE: C(ringy), U(nattractive), T(rash), and E(asy to forget)" Aggressive "I'm not saying I hate you, but I would unplug your life support to charge my phone" Self-enhancing "Can a kangaroo jump higher than the Empire State Building?Of course.The Empire State Building can't jump" Sexual "What's the difference between a G-spot and a golf ball?A man will actually search for a golf ball" Vol.:(0123456789) Scientific Reports | (2023) 13:19035 | https://doi.org/10.1038/s41598-023-45935-1

Table 2 .
Accuracy achieved with different predictor sets.AUC Area under the receiver operating characteristic curve; 95% confidence intervals only include shifts by maximum 0.01 for all metrics.

Table 3 .
Example stimuli.Text formatting was simplified for the table.